Skip to content

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)#861

Open
JoeProAI wants to merge 3 commits intoopenai:mainfrom
JoeProAI:submission/joeproai-11l-int5-ttt-1.1326
Open

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)#861
JoeProAI wants to merge 3 commits intoopenai:mainfrom
JoeProAI:submission/joeproai-11l-int5-ttt-1.1326

Conversation

@JoeProAI
Copy link
Copy Markdown

@JoeProAI JoeProAI commented Mar 26, 2026

11L U-Net + Int5 QAT + Score-First Legal TTT

3-seed mean val_bpb: 1.13391 (std 0.00153) | 15.51 MB (16,265,723 bytes) | 8xH100 (~37 min)


What's different

Built on the PR #549 stack. Key additions:

  • Int5 QAT — weights quantized to [-15, 15] per-row (stored int8 + float16 scale). Tighter than int6, better zstd compression ratio.
  • Score-first TTT — AdamW on MLP-only params (up_proj, down_proj, gate_proj, scale). lr=0.0004, 1 epoch. Order: score chunk first, then adapt. Legal per PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 recipe.
  • MLP_HIDDEN=1536 — reduced from 1792 to fit artifact under 16 MB with int5.
  • 15% weight pruning — zero smallest weights pre-quantization for better zstd compression.
  • Bigram hash embedding — 4096 buckets, 128-dim, added to token embeddings.
  • XSA on all 11 layers — full U-Net cross-layer shared attention.
  • Warmdown 6000 steps — longer QAT phase for better weight clustering near int5 boundaries.

3-Seed Results

Seed val_bpb Artifact
42 (submitted artifact) 1.13256182 15.51 MiB
314 1.13557402 15.60 MiB
2025 1.13360681 15.59 MiB
Mean 1.13391
Std 0.00153

All three seeds individually beat official SOTA (#549, 1.1194) by >0.01 BPB. All artifacts under 16 MiB.

Architecture

Param Value
Layers 11
Model dim 512
Heads 8
MLP hidden 1536
Bigram buckets 4096
Bigram embed dim 128
Vocab size 256
Tie embeddings false

Rule Compliance

  • Score-first TTT: tokens scored under inference_mode() before training on them
  • No val tokens used in artifact or training
  • No pre-eval adaptation
  • Submitted artifact: 15.51 MiB (under 16 MiB limit)
  • All validation artifacts under 16 MiB
  • Training time: ~37 min | Eval time: ~192s (under 600s budget)
  • 3-seed validation (seeds 42, 314, 2025)

Train log, submission.json, and training script included.

…g to fit int6 under 16MB

- INT6_CLIP_PERCENTILE now reads from env (default 99.99984, wave46 uses 99.0)
- PRUNE_PCT added to 1.0677 script (was missing, wave46 uses 0.25)
- Modal harness wave46_clip_prune.py for detached runs
- Both levers push zeros into weight tensors for better zstd compression
- Base architecture: SwiGLU + U-Net + XSA4 + BigramHash(8192) = 1.0677 BPB pre-compression
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant